

#### Thoughts on Commercial Off the Shelf (COTS) Electronics for Space

or

#### There's No Radiation Hardened Radio Shack<sup>™</sup> on the Moon

Kenneth A. LaBel ken.label@nasa.gov 301-286-9936 Michael J. Sampson michael.j.sampson@nasa.gov 301-614-6233

**Co- Managers, NEPP Program** 

NASA/GSFC

http://nepp.nasa.gov

Unclassified



#### Outline

- Background
- Qualification vs. Screening
- Risk Trade Space
- Radiation Effects Perspective
- Higher Assembly Levels?
- Summary



Hubble Space Telescope courtesy NASA



#### **Assurance for Electronic Devices**

#### Assurance is

- Knowledge of
  - The supply chain and manufacturer of the product,
  - The manufacturing process and its controls, and,
  - The physics of failure (POF) related to the technology.
- Statistical process and inspection via
  - Testing, inspection, physical analyses and modeling.
- Understanding the application and environmental conditions for device usage.
  - This includes:
    - Radiation,
    - Lifetime,
    - Temperature,
    - Vacuum, etc., as well as,
    - Device application and appropriate derating criteria.



## **NASA and COTS**

- NASA has been a user of COTS electronics for decades, typically when
  - Mil/Aero alternatives are not available (performance or function or procurement schedule),
  - A system can assume possible unknown risks, and,
  - A mission has a relatively short lifetime or benign space environment exposure.
- In most cases, some form of "upscreening\*" has occurred.
  - A means of measuring a portion of the inherent reliability of a device.
  - Discovering that a COTS device fails upscreening has occurred in almost every flight program.

#### \*upscreening – performing tests/analysis on electronic parts

#### for environments outside the intended/guaranteed range of a device



## **Reliability and Availability**

- Reliability (Wikipedia)
  - The ability of a system or component to perform its required functions under stated conditions for a specified period of time.
- Availability (Wikipedia)
  - The degree to which a system, subsystem, or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at an unknown, *i.e.*, a random, time. Simply put, availability is the proportion of time a system is in a functioning condition. This is often described as a *mission capable rate*.
- The question is:
  - Does it HAVE to work? Or
  - Do you just WANT it to work?

# What does this mean for EEE parts?

- The more understanding you have of a device's failure modes and causes, the higher the confidence level that it will perform under mission environments and lifetime
  - High confidence = "have to work"
    - The key is operating without a problem when you need it to (appropriate availability over the mission lifetime)
  - Less confidence = "want to work"
    - This is not saying that it won't work, just that our confidence to be available isn't as high (or is unknown)
- Qualification processes are statistical beasts designed to understand/remove known reliability risks and uncover unknown risks inherent in a part.
  - Requires significant sample size and comprehensive suite of piecepart testing (insight)



## **Screening <> Qualification**

- Electronic component screening uses environmental stressing and electrical testing to identify marginal and defective components within a "lot" of devices.
  - This is opposed to qualification which is usually a suite of harsher tests (and often destructive) intended to fully determine reliability characteristics of the device over a standard environment/application range
- Diatribe: what is a "lot"?
  - For the Mil/Aero system, it is devices that come from the same wafer diffusion (i.e., silicon lot from the same wafer)
  - For all others, it is usually the same "packaging" date
    - I.e., silicon may or may not be the same, but the devices were packaged at the same time. This raises a concern often known as "die traceability".
  - Device failure modes often have variance from silicon lot to silicon lot.

#### Why COTS? The Growth in Integrated Circuit Availability

- The semiconductor industry has seen an explosion in the types and complexity of devices that are available over the last several decades
  - The commercial market drives features
    - High density (memories)
    - High performance (processors)
    - Upgrade capability and time-to-market
      - Field Programmable Gate Arrays (FPGAs)
    - Wireless (Radio Frequency (RF) and mixed signal)
    - Long battery life (Low-power Complementary Metal Oxide Integrated Cycling Bib Semiconductors (CMOS)) and MP3



Zilog Z80 Processor circa 1978 8-bit processor

FPGA: field programmable gate array RF: radio frequency CMOS: complementary metal oxide semiconductor Intel 65nm Dual Core Pentium D Processor circa 2007 Dual 64-bit processors





## **The Changes in Device Technology**

Besides increased availability, many changes have taken place in

- Base technology,
- Device features, and,
- Packaging

- DIP: dual in-line package LCC: leaded chip carrier FCBGA: flip chip ball grid array SOI: silicon on insulator
- The table below highlights a few selected changes

| <u>Feature</u>    | <u>circa 1990</u>    | <u>circa 2007</u>            |
|-------------------|----------------------|------------------------------|
| Base technology   | bulk CMOS/NMOS       | CMOS with strained Si or SOI |
| Feature size      | > 2.0 um             | 65 nm                        |
| Memory size -     |                      |                              |
| volatile (device) | 256 kb               | 1 Gb                         |
| Processor speed   | 64 MHz               | > 3 GHz                      |
| FPGA Gates        | 2k                   | > 1M                         |
| Package           | DIP or LCC - 40 pins | FCBGA - 1500 balls           |
| Advanced system   |                      | >Gbps Serial Link, Serdes,   |
| on a chip (SOC)   |                      | embedded processors,         |
| features          | Cache memory         | embedded memory              |
|                   |                      |                              |

Now commercial technology is pushing towards 14nm, 3D transistors, and substrates, etc...

# The Challenge for Selecting ICs for Space

- Considerations since the "old days"
  - High reliability (and radiation tolerant) devices
    - Now a very small market
      percentage
  - Commercial "upscreening"
    - Increasing in importance
    - Measures reliability, does not enhance
  - System level performance and risk
    - Hardened or fault tolerant "systems" not devices

ADC: analog-to-digital converter SDRAM: synchronous dynamic random access memory SerDes: serializer-deserializer ASIC: application-specific integrated circuit

DSP: digital signal processor



System Designer Trying to meet high-resolution instrument requirements AND long-life



## The Trade Space Involved With Part Selection

- Evolution of IC space procurement philosophy
  - OLD: Buy Radiation Hardened Devices Only
  - NEW: Develop Radiation Tolerant Systems
- This is now systems design that involves a risk management approach that is often quite complex.
- For the purposes of this discussion, we shall define ICs into two basic categories
  - Space-qualified which may or may not be radiation hardened, and,
  - Commercial
- Understanding Risk and the Trade Space involved with these devices is the new key to mission success
  - Think size, weight, and power (SWaP), for instance



Performance Inside a Apple iPhone™



### **IC Selection Requirements**

- To begin the discussion, we shall review IC selection from three distinct and often contrary perspectives
  - Performance,
  - Programmatic, and,
  - Reliability.
- Each of these will be considered in turn, however, one must ponder all aspects as part of the

process





#### **Performance Requirements**

- Rationale
  - Trying to meet science, surveillance, or other performance requirements
- Personnel involved
  - Electrical designer, systems engineer, other engineers
- Usual method of requirements
  - Flowdown from science or similar requirements to implementation
    - i.e., ADC resolution or speed, data storage size, etc...
- Buzzwords
  - MIPS/watt, Gbytes/cm<sup>3</sup>, resolution, MHz/GHz, reprogrammable
- Limiting technical factors beyond electrical
  - Size, weight, and power (SWaP)

To be presented by Ken LaBel at the NASA Electronic Parts and Packaging Program (NEPP) Electronics Technology Workshop (ETW), NASA Goddard Space Flight Center in Greenbelt, MD, June 11-12, 2013 and published on nepp.nasa.gov.



MIPS: millions of instructions per second



#### Programmatic Requirements and Considerations

- Rationale
  - Trying to keep a program on schedule and within budget
- Personnel involved
  - Project manager, resource analyst, system scheduler
- Usual method of requirements
  - Flowdown from parent organization or mission goals for budget/schedule
    - I.e., Launch date
- Buzzwords
  - Cost cap, GANTT/PERT chart, risk matrix, contingency
- Limiting factors
  - Parent organization makes final decision

To be presented by Ken LaBel at the NASA Electronic Parts and Packaging Program (NEPP) Electronics Technology Workshop (ETW), NASA Goddard Space Flight Center in Greenbelt, MD, June 11-12, 2013 and published on nepp.nasa.gov.



Programmatics A numbers game



## **Risk Requirements**

- Rationale
  - Trying to ensure mission parameters such as reliability, availability, operate-through, and lifetime are met
- Personnel involved
  - Radiation engineer, reliability engineer, parts engineer
- Usual method of requirements
  - Flowdown from mission requirements for parameter space
    - I.e., SEU rate for system derived from system availability specification
- Buzzwords
  - Lifetime, total dose, single events, device screening, "waivers"
- Limiting factors
  - Management normally makes "acceptable" risk decision





### **Understanding Risk**

- The risk management may be broken into three considerations
  - Technical/Design "The Good"
    - Relate to the circuit designs not being able to meet mission criteria such as jitter related to a long dwell time of a telescope on an object
  - Programmatic "The Bad"
    - Relate to a mission missing a launch window or exceeding a budgetary cost cap which can lead to mission cancellation
  - Radiation/Reliability "The Ugly"
    - Relate to mission meeting its lifetime and performance goals without premature failures or unexpected anomalies
- Each mission must determine its priorities among the three risk types



To be presented by Ken LaBel at the NASA Electronic Parts and Packaging Program (NEPP) Electronics Technology Workshop (ETW), NASA Goddard Space Flight Center in Greenbelt, MD, June 11-12, 2013 and published on nepp.nasa.gov.



## The Risk Trade Space –

**Considerations for Device Selection (Incomplete)** 

- Cost and Schedule
  - Procurement
  - NRE
  - Maintenance
  - Qualification and test
- Performance
  - Bandwidth/density
  - SWaP
  - System function and criticality
  - Other mission constraints (e.g., reconfigurability)
- System Complexity
  - Secondary ICs (and all their associated challenges)
  - Software, etc...

- Design Environment and Tools
  - Existing infrastructure and heritage
  - Simulation tools
- System operating factors
  - Operate-through for single events
  - Survival-through for portions of the natural environment
  - Data operation (example, 95% data coverage)
- Radiation and Reliability
  - SEE rates
  - Lifetime (TID, thermal, reliability,...)
  - "Upscreening"
- System Validation and Verification

NRE: non-recurring engineering IC: integrated circuit SEE: single-event effect TID: total ionzing dose



## **Systems Engineering and Risk**

- The determination of acceptability for device usage is a complex trade space
  - Every engineer will "solve" a problem differently
    - Approaches such as synchronous design may be the same, but exact implementations are never the same
- A more omnidirectional approach is taken weighing the various risks
  - Each of the three factors may be assigned weighted priorities
    - The systems engineer is often the "person in the middle" evaluating the technical/reliability risks and working with management to determine acceptable risk levels



### **Traditional Risk Matrix**





#### An Example "Ad hoc" Battle

- Mission requirement: High resolution image
  - Flowdown requirement: 14-bit 100 Msps ADC
    - Usually more detailed requirements are used such as Effective Number of Bits (ENOB) or Integral Non-Lineariy (INL) or Differential Non-Linearity (DNL) as well
  - Designer
    - Searches for available radiation hardened ADCs that meet the requirement
    - Searches for commercial alternatives that could be upscreened
    - Looks at fault tolerant architecture options
  - Manager
    - Trades the cost of buying Mil-Aero part requiring less aftermarket testing than a purely commercial IC
    - Worries over delivery and test schedule of the candidate devices
  - Radiation/Parts Engineer
    - Evaluates existing device data to determine reliability performance and additional test cost and schedule
- The best device? Depends on mission priorities



## Radiation Perspective on IC Selection

- From the radiation perspective, ICs can be viewed as one of four categories.
  - Guaranteed hardness
    - Radiation-hardened by process (RHBP)
    - Radiation-hardened by design (RHBD)
  - Historical ground-based radiation data
    - Lot acceptance criteria
  - Historical flight usage
    - Statistical significance
  - Unknown assurance
    - New device or one with no data or guarantee



#### RHBD Voting Approach

http://www.aero.org/publications/crosslink/summer2003/06.html



# **"Guaranteed" Radiation Tolerance**

- A limited number of semiconductor manufacturers, either with fabs or fabless, will guarantee radiation performance of devices
  - Examples:
    - ATMEL, Honeywell, BAE Systems, Aeroflex
  - Radiation qualification usually is performed on either
    - Qualification test vehicle,
    - Device type or family member, or
    - Lot qualification
  - Some vendors sell "guaranteed" radiation tolerant devices by "cherry-picking" commercial devices coupled with mitigation approaches external to the die
- The devices themselves can be hardened via
  - Process or material (RHBP or RHBM),
  - Design (RHBD), or
  - Serendipity (RHBS)

Most radiation tolerant foundries use a mix of hardening approaches



#### Archival Radiation Performance – Ground-based Data

- Reviewing existing ground radiation test data on a IC and it's application has been discussed previously
  - For example. Christian Poivey at NSREC Short Course in 2002
  - Using a "similar" device with data is risky, but sometimes considered (though not recommended)
     NSREC: Nuclear and Space Radiation Effects Conf.
- In general, the flow is shown below





#### Archival Radiation Performance – Flight Heritage

- Can we make use of parts with flight heritage and no ground data for new mission?
- Similar flow to using archival ground data exist, but consider as well
  - Statistical significance of the flight data
    - Environment severity?
    - Number of samples?
    - Length of mission?
  - Has storage of devices affected radiation tolerance or reliability?
  - And so forth
- This approach is rarely recommended by the radiation experts



Some heritage designs last better than others

# IC's with no Guarantee or Heritage

- Radiation testing is required in the vast majority of cases
  - Testing complexities and challenges are discussed elsewhere
  - The true challenge is to gather sufficient data in a cost and schedule effective manner.
    - A backup plan should be made in case device fails to pass radiation criteria.
- Reliability testing has similar concerns

**FPGA-based motherboard** 



**SDRAM** mounted on a daughtercard

#### "Abandon all hope, ye' who enter here"



## **Is Testing Always Required?**

#### • Exceptions for testing may include

- Operational
  - Ex., The device is only powered on once per orbit and the sensitive time window for a single event effect is minimal
- Acceptable data loss
  - Ex., System level error rate may be set such that data is gathered 95% of the time. This is data availability. Given physical device volume and assuming every ion causes an upset, this worst-case rate may be tractable.
- Negligible effect
  - Ex., A 2 week mission on a shuttle may have a very low Total lonizing Dose (TID) requirement. TID testing could be waived.



A flash memory may be acceptable without testing if a low TID requirement exists or not powered on for the large majority of time.



Evaluation Method of Commercial Off-the-Shelf (COTS) Electronic Printed Circuit Boards (PCBs) or Assemblies



#### We can test devices, but how do we test systems?





# Cots PCBs include:

- The inability to trace die heritage or in some cases lack of information on "datasheets"
- The limited testability of printed circuit boards (PCBs) due to complex circuitry and packaging issues ("visibility" issues)
- The issue of piecepart versus board level tests
  - Board performance being monitored, not device
  - Error/fault propagation often time dependent
- The possibility of "board-to-board" IC variances for "copies" of the "same" PCBs
  - Lot-to-lot, device-to-device
- The ability to simulate the space radiation environment with a single particle test
- Limited parts list information
  - Bill-of-materials often does NOT include lot date codes or manufacturer of device information
- Statistics are often limited
  - It's easier to purchase and test 10 devices than 10 PCBs (cost and schedule), thus the number of test samples is reduced
  - Parts "variability"



#### **Summary**

- In this talk, we have presented considerations for selection of ICs for space systems
  - Technical, programmatic, and risk-oriented
    - As noted, every mission may view the relative priorities between the considerations differently
- As seen below, every decision type may have a process.
  - It's all in developing an appropriate one for your application.



#### **Five stages of Consumer Behavior**

http://www-rohan.sdsu.edu/~renglish/370/notes/chapt05/



# BACKUP



#### Estimated Test/Parts Costs for Complex Device Normalized to FY98





#### Disclaimer: Statistics and "Qualification"



| Commercial 1 Gb SDRAM              |  |  |
|------------------------------------|--|--|
| -68 operating modes                |  |  |
| -can operate to >500 MHz           |  |  |
| -Vdd 2.5V external, 1.25V internal |  |  |

#### **Single Event Effect Test Matrix**

#### full generic testing

| Amount | Item                                |
|--------|-------------------------------------|
| 3      | Number of Samples                   |
| 68     | Modes of Operation                  |
| 4      | Test Patterns                       |
| 3      | Frequencies of Operation            |
| 3      | Power Supply Voltages               |
| 3      | lons                                |
| 3      | Hours per Ion per Test Matrix Point |

| 66096 | Hours |
|-------|-------|
| 2754  | Days  |
| 7.54  | Years |

Doesn't include temperature variations!!!

The more complex a device, the more application-specific the test results